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Abstract 

Artificial Neural Networks (ANN) comprise important symmetry properties, which can influence the 

performance of Monte Carlo methods in Neuroevolution. The problem of the symmetries is also known as 

the competing conventions problem or simply as the permutation problem. In the literature, symmetries 

,__( are mainly addressed in Genetic Algoritm based approaches. However, investigations in this direction 

^-H based on other Evolutionary Algorithms (EA) are rare or missing. Furthermore, there are different and 

f^ contradictionary reports on the efficacy of symmetry breaking. By using a novel viewpoint, we offer 

04 a possible explanation for this issue. As a result, we show that a strategy which is invariant to the 

,— H global optimum can only be successfuU on certain problems, whereas it must fail to improve the global 

3 convergence on others. We introduce the Minimum Global Optimum Proximity principle as a generalized 

' and adaptive strategy to symmetry breaking, which depends on the location of the global optimum. We 

O^ apply the proposed principle to Differential Evolution (DE) and Covariance Matrix Adaptation Evolution 

Cn Strategies (CMA-ES), which are two popular and conceptually different global optimization methods. 

^_^ Using a wide range of feedforward ANN problems, we experimentally illustrate significant improvements 

r-r^ in the global search efficiency by the proposed symmetry breaking technique. 

t>2 1 Introduction 

Artificial Neural Networks (ANN) are general function approximators [13] and can be used to find a functional 

T— I representation of a data set. Another point of view is that ANN's represent a way of data compression [2]. 

K*" The compression ratio depends on the number of neurons used in the ANN which encodes the data: the less 

^^ neurons at the same representation quality, the better the compression. 

*■ ■ Given a problem, there are generally two kinds of optimization tasks for the learning process of ANN's. 

—^ The first one is to find a network topology, i.e., the optimal number of layers and the optimal number of 

neurons per layer. The second task is to find the parameters of the network, given a topology. In this paper, 
we focus on the second task and assume a predefined topology. 

The estimation of the ANN-parameters is generally a computationally demanding task [28] . The corre- 
sponding Maximum-Likelihood derived error function comprises many local optima. Therefore, local search 
techniques to find an optimal solution generally fail and typically converge to a suboptimal solution [13]. In 
addition, local search techniques are mainly sequential methods and parallel implementations are limited. 
On the other hand, global optimization techniques based on Monte Carlo methods such as the Genetic Al- 
^ gorithm (GA) [7, 21], Covariance Matrix Adaptation Evolution Strategies (CMA-ES) [12, 11] or Differential 

Evolution (DE) [29, 22, 34] are generally very well parallelizable. Differential Evolution is one of the most 
popular and robust Monte Carlo global search methods, which outperforms many other evolutionary algo- 
rithms on a wide range of problems [3, 33, 36]. DE is successfully used in various engineering problems such 
as multiprocessor synthesis [23], optimization of radio network designs [20], training Radial Basis Function 
networks [18], training multi layer neural networks [15] and many others [5]. On the other hand, CMA-ES is a 
state-of-the-art evolutionary algorithm, which is also used for ANN-learning [27, 26, 8] and other engineering 
tasks [24, 16, 25]. 

Due to inherent symmetries in the parametric representation of ANN's, there are multiple global op- 
tima in the parameter space. The multiple global optima result from point symmetries and permutation 
symmetries [30, 31]. In the literature, this problem is also known as the competing conventions problem, or 
simply the permutation problem. In [32, 31], significant improvements are reported by different approaches 
to symmetry breaking for GA's. However, in both publications, the improvement is shown using only one 
single test-case, respectively. On the other hand, in [10, 9] contradictionary results are presented, where the 
effect of removing these symmetries on GA's is reported to be minimal and negligable, and even leading to 
reduced performance. 



o 



> 

X 



Furthermore, crossover operators used in GA's are reported to be a source of the problems caused by 
symmetries [6]. Therefore, some researchers disable crossover or apply EA's which do not have crossover at 
all [37]. 

To our best knowledge, there are no reports on the impact of the ANN-symmetries regarding the perfor- 
mance of the DE and CMA-ES methods. In this paper, we show that the performance of DE and CMA-ES 
are highly sensitive to the presence of multiple global optima, and that symmetries are also an issue on the 
performance of EA's without crossover operators. We show that there are infinitely many ways of symmetry 
breaking, which differ in the way they partititon the parameter space. Furthermore, we argue that an ef- 
fective way of partitioning should depend on the location of the global optimum and its symmetric replicas. 
Therefore, we derive a symmetry breaking operator based on considerations about the partitioning of the 
ANN-parameter space, which is optimal according to a Minimum Global Optimum Proximity condition. By 
theoretical considerations and numerous experimental studies on offline supervised learning problems, we 
show that typical approaches to symmetry breaking, which are invariant to the global optimum, may lead to 
superior or inferior results, depending on the ANN-problem. 

On the other hand, we show that the proposed global optimum variant approach for symmetry breaking 
leads to consistent and significant improvements in the estimation of ANN-parameters. 

The paper is organized as follows. In the following Section, we briefly review Artificial Feedforward Neural 
Networks (ANN) . Section 3 defines the term 'symmetry' and introduces the types of symmetries found in the 
optimization of ANN-parameters. In Section 4, we discuss existing approaches to symmetry breaking. In this 
Section, we also reformulate the rules applied by existing approaches to prepare a more general view to the 
topic. In Section 5, we introduce the 'Minimum global optimum proximity' principle and propose symmetry 
breaking methods based on this principle. In Section 6, we present the conducted experiments and obtained 
results, followed by the Section of Conclusions, where the main contributions are emphasized. 

2 Brief review of Artificial Feedforward Neural Networks 

Artificial (Feedforward) Neural Networks (ANN) are used for approximation of functions / : M'^ — ;> M'^. 
Ann's typically have multiple layers of artificial neurons. Assuming that an ANN has L layers, the first and 
the last layer are called as the input and the output layer, respectively. Remaining L — 2 layers are called as 
hidden layers. 

For the n-th neuron {l,n) in layer /, we denote a parameter vector by 

rii,^{wl,TiJ, n^l,...,Ni, (1) 

where wl^ is the weight vector of dimension equal to the number of inputs available to the neuron and r^ is 
the shift scalar. The output of a tanh-type sigmoid neuron {l,n) is given by 

xl = ti,nhiwCx'-'+r^), (2) 

where x^ — {x\, . . . ,a;5v ) is the output vector of layer I. After all hidden layers I — 2,3, ...,L~l are evaluated, 
the output layer component y„ of the output vector y is typically obtained by the following two alternative 
ways: 

ijn = w!^ x^~^, n=l, ...,q (regression), (3) 

jjn ~ t&nh{w^ x^^^), n = 1, ..., (7 (classification). (4) 

We denote the parameter vector of all neurons in a layer / by A' , where 

A' = (r,l,...,,7k)- (5) 

The vector of all the parameters in the network is given by 

ea^{x^...,x''-\w^,...,w!^), (6) 

where w^ — {w^i, . . . , w^ jv _ )' n — 1, ..., q, is the vector of the output layer weights for output j)„. The 
function defined by the network is denoted by 

y = n(0a;x), (7) 



where x is the input vector, which is notationwise equal to the output of the input layer, so that x^ = x. 
Assuming additive normal i.i.d. noise on the available data {xk, Vk), k = 1, ..., K, the ML-estimate 6a of the 
parameters 6a can be obtained by the minimizer to the following least squares optimization problem: 

K 

da = argmin^(yfe - ^{6a:,Xk)Y {yk - i}{da;Xk)). (8) 

For regression problems, the output layer is linear as shown in Eqn. (3). Thus, the corresponding weights w^ 
can be determined by a least squares method, as described in [19], which we adopt in this paper. This has the 
advantage that global search is applied only to the non-linear part of the parameter space, which generally 
speeds up convergence. For classification problems, we assume that an output vector y of a data-sample 
designating class i has the following format 

Although the output layer is non-linear as shown in Eqn. (4), corresponding weights w^ can still be 
determined linearly in the training phase. For this, the output vectors of the training data are rescaled by 
factor 20, such that tanh(20) w 1 and tanh(O) = 0. The weights of the output layer are determined by a 
least squares method using the rescaled data. Given the remaining parameters, Eqn. (8) is applied by using 
the non-rescaled data. 

Consequently, the parameter vector 6 for the global optimization can be reduced to 

= (A2,...,A^-1). (10) 

The important problem of how to choose the net topology is not considered in this paper. For a given net- 
topology, we focus on the effect of symmetry breaking on the efficiency of the optimization of the parameters 
in (10). In the following Section, we investigate the symmetries in the ANN-parameter space. 

3 Symmetries in ANN's 

A symmetry is an operator $ which does not change the output of an ANN when applied to the parameter 
vector 6: 

n{6; x) = f7($(6»); x), V6», x. (11) 

Non reducable ANN's comprise two types of symmetries [30]. The first type is a point sym,m,etry on the 
neuron parameter level, since 

ii;tanh(a;) = — wtanh(— a;), yw,x. (12) 

The following definition of a point symmetry operator Oj^ 

"'■•"M'l^ - ^b:."^ A^« ™ 

changes the sign of the parameters of neuron {l,n) and the n-th weight component u'j'^"'^ of all neurons {l + l,i) 
in the following layer I + 1. It satisfies the symmetry condition because of Eqn. (12). In Fig. 1, an example 
for the application of Of is shown. For each layer I, the point symmetry yields 2^' symmetric replicas of the 
parameter vector 6. 

The second type of symmetry is a permutation symmetry by the neuron parameters t] and the corre- 
sponding weight parameters in the next layer. A permutation operator P- ^ defined by 



leaves the output invariant. Note that P'^, — Pj. y In Fig. 2, the application of P^2 — -^1 1 i^ illustrated. 
In each layer Z, there are TV;! symmetric replicas of the parameter vector 6 due to permutation symmetries. 
Combining both symmetries, the total count of symmetric replicas per layer I is 2^'^ Nil. Another important 
property is that the length of the vector 6 is invariant under such symmetry operators, 

\M6)\\ = \\6\IW6, (15) 

since the point symmetry operator only changes the sign of some components of the parameter vector, whereas 

the permutation symmetry operator only swaps some components. 
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Figure 1: Application of the point symmetry operator O^, which changes the signs of rjl -parameters in layer 
two and wf ^ -parameters in layer three, respectively. 
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Figure 2: Application of the permutation symmetry operator Pi 2 — P21: which exchanges the parameters 
Vi ■^ ^2 *'^ layer two and the parameters Wi i -(-^ wf 2j ^2 1 "^ ''^2 2 ^'^ layer three. 



Lemma 3.1. Symmetry operators are linear and orthogonal operators. 

Proof. The proof for the hnearity of these operators is trivial and therefore omitted in this paper. The 
orthogonahty follows from Eqn. (15): 



|$6»|| = \\9\\,ye => ||$0||^ = Il^lr,v6> 

=> ($0)^($0) ^e^e,\/e 



(16) 
(17) 
(18) 

n 



Furthermore, applying the same point symmetry operator two times subsequently does not change the 
parameter vector, since switching the signs of selected components a second time reverts the first sign-change. 
The same holds also for the permutation symmetry operator: swapping the selected components a second 
time reverts the first swapping. Therefore, we can write 



OM 



pi pi 



(19) 



where I is the identity operator. As a result, point symmetry, permutation symmetry as well as joint sym- 
metry operators correspond to rotations and all symmetric replicas of a global optimum lie on a hypersphere. 
Since such symmetries multiply the local and global optima count in the parameter space, the ultimate goal 
of symmetry breaking is to reduce the total number of local optima in the parameter space by avoiding all 
but one symmetrically equivalent space partitions. 

There are infinitely many ways for symmetry breaking by using the operators O^ and P- ^,, which depend 
on the condition upon which these operators are applied. As an example, consider a 2-D point symmetry as 
illustrated in Fig. 3. Limiting the search space to the upper half plane {y > 0) is one possibility to break 
the symmetry, where only one global optimum remains and the space is separated into two partitions. In 
this case, the point symmetry operator is to be applied only for y < 0. Another possibility is to reduce the 
space to the right half plane (a; > 0) . This is realized by applying the point symmetry operator only on the 
condition x < 0. By rotating the coordinate system, we obtain infinitely many other ways to separate and 



reduce the space. As a result, there is a degree of freedom on the choice of a specific condition or separation. 
We derive similar results also for the permutation symmetry. In Section 5, we argue that there is an optimal 
choice for a specific symmetry breaking condition (separation) based on considerations about the location 
of the global optimum. We exploit the degree of freedom on the choice of a specific condition by choosing 
a condition such that the distance of the global optimum to the separating region is maximal. In other 
words, we demand that the proximity of the global optimum to the separating region is minimal. This way, 
the influence of neighboring global optima is minimized and the symmetry breaking can be realized most 
effectively. 

A detailed discussion about an optimal separation follows in Section 5. 

4 Existing approaches to deal with symmetries 

A commonly used method is to reduce the parameter space to one single symmetrically equivalent region, 
also called partition. To achieve this, the following rules can be applied [31]: 

rule-1 The shift parameter of all neurons is ensured to be positive by flipping the signs of the parameters 
when required, for each neuron. 

rule-2 In each hidden layer, neurons are sorted according to the shift parameter. 

This method and all other similar methods can be realized by applying a chain of the operators Oj^ and 
P'j,. In the following, we show that these rules are suboptimal, and in some cases may even cause inferior 
performance. We show that rules for symmetry breaking should take the position of the global optimum into 
account in order to be effective. Therefore, we denote rule-1 and rule-2 as global optimum invariant, and 
rules which depend on the global optimum as global optim,um variant. 

4.1 Global optimum invariant point symmetry breaking 

Assuming a point symmetric function f{x,y) = /(— a;, — y), Vx,y, Fig. 3 shows two cases where rule-1 is 
applied such that all y-coordinates are forced to be positive. As a consequence, all solution candidates are 
located in the upper half plane and the parameter space is effecively reduced. There is only one remaining 
global optimum fj. In the left plot, the global optima 77 and —fj are relatively far away from the x-axis, 
whereas in the right plot, the global optima are close to the x-axis, although they have the same distance 
to the origin in both plots. In case of the right plot, there exists an 'artificial' local optimum due to the 
proximity of the hidden global optimum —17, where some solution candidates may be attracted to. The main 
problem is that after applying symmetry breaking, some solution candidates may still be closer to the hidden 
global optimum —17 than to i). As a result, the goal of reducing the influence of other global optima is 
not fully achieved. Furthermore, the introduced artificial local optimum may trap some solution candidates 
without having a chance to ever reach the corresponding 'hidden' global optimum —f}. We believe that this 
is the main reason why an inferior performance is reported by some symmetry breaking approaches. Note 
that this situation depends on the location of the global optimum, which in turn depends on the problem at 
hand. Therefore, this issue arises on some problems, whereas on others, a symmetry breaking with increased 
performance can be achieved by these rules. In Fig. 3, the x-axis is the region S of separation 

5 = {A:(A,0)}. (20) 

The separating region depends on the rule and divides the parameter space into partitions. As an example, an 
alternative rule, which would force all x coordintates to be positive, would have the y-axis as the separating 
region. We repeat that the distance of the global optimum to the separating region is crucial for effective 
symmetry breaking, and that it should be arranged to have this distance as large as possible. Another 
equivalent goal is to apply symmetry breaking such that no solution candidate is closer to the hidden global 
optimum than to the global optimum of the selected partition. 

4.2 Global optimum invariant permutation symmetry breaking 

Similar problems caused by rule-1 also arise by the application of rule-2. This is shown in the following 
example. We use a 2x2 parameter structure, i.e., two neurons with two parameters (0^,6^) per neuron i: 
6 = (oi, 61, 02, 62)- From the permutation symmetry follows that 

f{{ai,hi,a2,b2)) = J{{a2J>2,ai,bi)), Vai, 61, 02, 62, (21) 
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Figure 3: Example for a point symmetry in 2-D, where f{x, y) — f{—x, —y) \fx, y. It is assumed that rule-1 
is applied to force all solution candidates to be in the upper half plane (y > 0). As a result, the parameter 
space is effecively reduced and there is only one remaining global optimum fj. In the left plot, the global 
optima i) and —f) are relatively far away from the x-axis, whereas in the right plot, the global optima are 
close to the x-axis, although they have the same distance to the origin in both cases. In case of the right plot, 
there exists an 'artificial' local optimum due to the proximity of the hidden global optimum ~fj, where some 
solution candidates may be attracted to. The main problem is that after applying symmetry breaking, some 
solution candidates may still be closer to the hidden global optimum —f) than to fj. 



where / shall be the error function. Let the global optimum be at = (2, 1, —2, 3). There are two possibilities 
to apply rule-2: sorting by parameter a or sorting by parameter b, respectively. The separating region varies 
for each choice. Choosing to sort by parameter a yields Sa, whereas sorting by parameter b yields Sb- 



Sa = {a, 13, A : (A, a, A, /?)}, S^ = {a, {3, A : (a, A, {3, A)}. 



(22) 



We show that each separation region has a different distance to the global optimum 6. The closest point 
on 5a to is at A = 0, a = l,/3 = 3, which yields the distance -\/8- On the other hand, the closest point 
on 5h to ^ is at A = 2, a = 2, /3 = —2, which yields the distance \/2- In this example, applying rule-2 by 
ordering the a-coordinates results in a better sparation of the partitions. Would the global optimum be at 
6 — (1,2,3,-2), the opposite case would apply. Consequently, similar to rule-1 in the previous Section 4.1, 
rule-2 can only be effective on some problems. 

5 Minimum global optimum proximity principle 

In this Section we propose new methods for symmetry breaking to avoid the problems described in Section 4. 
Here, we assume that the basin, or the region of influence of the global optimum is isotropic. Although this 
assumption does not apply in general, it is introduced to simplify the discussion. Also, this simplification 
enables us to easily derive theoretically motivated methods, which prove to be very effective in a wide range 
of problems. In the presentation, we first consider the point symmetry, then the permutation symmetry and 
finally the general joint symmetry as a combination of both point and permutation symmetries. 



5.1 Minimum global optimum proximity principle for point symmetry 

The differences between possible rules to apply the point symmetry operator arise from the condition on 
which the operator is to be applied. Fig. 4 shows different rules with corresponding separation regions for 
breaking a point symmetry in relation to the global optimum. It can be seen that the separating region 
which has maximum distances to the global optima, which means that the according proximity is minimal, 
enables the optimal separation or partitioning. This way, an optimal isolation between all symmetric replicas 
of the global optimum is achieved. As a result, the disturbing infiuence of other neighboring global optima 
is decreased to a minimum, which in turn effectively maximizes the attraction of the global optimum of the 
selected partition. 

The following Lemma provides a more general perspective for rule-1 presented in Section 4. Note that 
the shift parameter is the last entry in the parameter vector. 
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Figure 4: Example for a point symmetry in 2-D, where f(r]) = f{^r)) Vrj. The plots show worst case (left), 
suboptimal (middle) and optimal separation lines (right) for point symmetry breaking. The separating line 
divides the parameter space in two parts, where each partition contains a global optimum (r) and —f)). 



Lemma 5.1. Rule-1 from. Section 4 modifies a parameter vector r) as: 



V = 



V if ||»7-(0,...,0,l)||2<||-r7-(0,...,0,l) 
V if ||»7-(0,...,0,l)||2>||-r7-(0,...,0,l) 



(23) 



Proof. From the first line in Eqn. (23) follows with rj — {xi, ..., xo) and a reference vector r = (0, ..., 0, 1) 



|?7 - r|p < II - jy- r|p 



<^ 



|r,-(0,...,0,l)|P<||-r7-(0,...,0,l)|p 

+ {-XD~lf. 






Further simplifying both sides of the equation yields 

— xd < xd ^ xd > 0. 



(24) 
(25) 

(26) 



(27) 



This means that the conditional Equation (23) is equivalent to rule-1 which demands that the shift parameters 
shall be positive. D 

The rule-structure introduced by Lemma 5.1 can be used to formulate the following strategy to maximize 
the distance of the global optimum fj to the separating region. 



V = 



T] for ll''? — ^IP < II — »7 — ''jI 



(28) 



rj otherwise 

Theorem 5.2. The solution candidate ry' determined by rule (28) is always closer to fj than to ~fj. 
We will prove Theorem 5.2 in a more general setting in Section 5.3. 

5.2 Minimum global optimum proximity principle for permutation symmetry 

Lr this Section we introduce an optimal rule for breaking a permutation symmetry for parameter spaces with 
two blocks of permutation-invariant parameters. We define a parameter vector 6 as 

o^{m,m) = ivi\v2), (29) 

where the notation (J71IJ72) is used to emphasize the block structure. The permutation symmetry is given by 

f{0) = /((r7i,r72)) - fiPO) - f{{m,Vi)), V0, (30) 

where / is the error function and P is a permutation operator defined by 

Pim^m) = (^2,m), V?7i,?72. (31) 

The following Lemma restates rule-2 as a distance dependent rule. 



||(a;i,i, ...,Xi^d\x2,i, ■■■,X2,d) ~ (0, ...,0|0, .. 


■,l)lP 


<||(a;2a, ...,X2,d\xi^i, ■■■,xi^d) - (0, ...,0|0, .. 


■,l)ll' 


■^xljy + {X2M - 1)^ < {xi^D - 1)^ + xljj 




<^X2M < Xi,D 





Lemma 5.3. Assuming the shift parameter is the last parameter in the parameter block r), rule-2, presented 
in Section 4, can alternatively be described in a more general form by the following rule: 

^,^ r (»7i|»72) for ||(j7i|»72)-(0,...,0|0,...,l)||2<||(r72|r,i)-(0,...,0|0,...,l)|p 

\ (^72 1 ^i) otherwise ^ ' 

Proof. From Eqn. (32) follows with r/i ~ (a;i_i, ...jXi.i)) 



(33) 

(34) 
(35) 

D 

We state the following proposal in order to maximize the distance of the global optimum B to the separating 
region, according to the rule-structure introduced by Lemma 5.3 

g,^ f (r7i|r/2) for Wir^M^) -~e\? <\\{m\^x) -e\? ,^^. 

\ (''72 1 J?!) otherwise ^ ' 

Theorem 5.4. The solution candidate 0' determined by rule (36) is always closer to than to P9. 
Theorem 5.4 will be proved in a more general setting in Section 5.3. 

5.3 Ideal symmetry breaking 

For a given ANN-optiniization problem, let V be the set of all possible symmetry operators. Note that a 
symmetry operator $ G T' may be a point symmetry, a permutation symmetry or a joint symmetry operator. 
A joint symmetry operator is generally composed of a chain of point symmetry and permutation symmetry 
operators. As an example, $ = 0| o P^,^ applies a permutation symmetry followed by a point symmetry 
operator. The following properties of symmetry operators are relevant in the following discussion. According 
to Eqn. (11), a symmetry operator does not change the output of the ANN when applied to the parameter 
vector 0. According to Eqn. (15) a symmetry operator does not change the length of a parameter vector. 
Furthermore, according to Eqn. (18), symmetry operators are orthogonal. 

Given a parameter vector 9, the set TZe of all symmetric replicas of 6 is defined by 

7^e = {$ e -P : $6/} = TO. (37) 

Recall that the ultimate goal of symmetry breaking is to minimize the influence of all symmetric replicas of 
the selected global optimum and to concentrate the global search to the partition where the selected global 
optimum is located. To achieve this, we propose the following joint separation condition: 

e' = arg min \\e - OW^ . (38) 

eeKs 

In other words, this optimization selects the closest symmetric replica of to the selected global optimum 6. 
Finding the closest symmetric replica of means finding the corresponding symmetry operator $', where 

6' = $'6». (39) 

In case the parameter vector 6 is already close to 0, i.e., it is in the corresponding partition, the solution for 
$' is the identity operator I. Note that, according to Eqn. (19), the identity operator I is in V . In Fig. 5, 
ideal symmetry breaking according to Eqn. (38) is illustrated on a hypothetical 2-D space. 

Theorem 5.5. The solution 9' determined by Equation (38) ensures that no other symmetric replica of the 
selected global optimum is closer to 6' than 6. In other words, it minimizes the influence of the symmetric 
replicas of the selected global optimum. 



Symmetric replica of global optimmn 



Figure 5: Ideal symmetry breaking according to Eqn. (38) shown on a hypothetical 2-D space. In this 
example, applying a point symmetry operator followed by a permutation symmetry operator maps to 9' , 
which is located at the partition of the selected global optimum , marked by a star. Note that a symmetry 
operator corresponds to a rotation, which preserves lengths as well as angles. 

Proof. We prove this by contradiction. According to Eqn. (38), \\0' — 0|p is minimal. Assume that there 
exists a global optimum replica 0' ^ with 

||6>'-0'|p < ||0'-0||2. (40) 

Due to the underlying symmetry properties, each global optimum replica can be mapped to another replica 
by a symmetry operator, i.e., there exists a symmetry operator $ G P/jI} which satisfies 

6' = ^e^e = ^-^e'. (41) 

Due to length-preserving property of symmetry operators, using Eqn. (39), the left-hand side of the Rela- 
tion (40) can be written as 

\\e' - O'W^ = \\<^-\9' - 0')lP = 11*^^6/' - $-i^'||2 = ||$-i0' - 0||2 (42) 

Since $ 7^ Z and therefore $^^ 7^ I, it follows that ^^^0' ^ 8' . But this means that 9' docs not minimize 
the distance to 6, which contradicts Eqn. (38). D 

5.4 Approximations of the ideal separation 

In order to take advantage of these results, we have to address two issues. First, the global optimum is 
not known a priori. Second, the brute force method for finding an optimal solution to (38) has exponential 
complexity, but a low-complexity algorithm is desired. In order to circumvent the first problem, we propose 
to use an estimate for the global optimum, which can be determined by the population of solution candidates 
at each iteration of the applied Monte Carlo method. Naturally, this estimate improves with increasing 
iteration number. The second problem can be addressed by using an approximation for the ideal separation 
achieved by (38). 

To describe the proposed method, for each neuron (l, n), we define a symmetry relevant parameter block 
/3i as 

/3i = {vLw[+^,...,w'+l^^„), l^2,...,L~2, (43) 

fit' = Vt\ (44) 

which includes also some corresponding parameters from the next layer / + 1. Given a parameter vector 6 
and an estimate of the global optimum with corresponding parameter blocks /3jj and /3jj, the pseudocode 1 
describes the proposed approximation for ideal symmetry breaking. In Fig. 6, the effect of the several 
symmetry breaking approaches is demonstrated on a hypothetical 2-D parameter space. 
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Algorithm 1 Proposed symmetry breaking method. A symmetry operator $ is only applied to the parameter 
vector 6 when it decreases the distance to the global optimum 0, i.e., \\<^6 — 0\\ < \\0 — 6\\. Algorithm input: 
6 and 6 . Effect: modification of the parameter vector 6 when appropriate. 

[breaking point symmetry] 

for all hidden layers Z = 2, . . . , L — 1 do 

for all neurons {l,n), n = 1, . . . ,Ni per layer I do 

// would the point symmetry operator O^ decrease the distance? (j|O5i0 — 0\\ < ||^ ^ ^||) 
calculate distance-square for NOT applying O^: Di = ||/3jj — /35i|p 
calculate distance-square for applying O^ : D2 = 1 1 — /3jj — /3jj | p 
if Di > D2 then 

apply point symmetry operator Oj^: set (3'^ — —0n 
end if 
end for 
end for 

[breaking permutation symmetry] 
for all hidden layers Z = 2, . . . , L — 1 do 

randomly choose two neurons {I, m), {l,n) £ {1, . . . , A'^;} in hidden layer I with m ^ n 

// would the permutation operator Pln,n decrease the distance? {\\Pln.nO — 0\\ < \\d — 0\\) 
calculate distance-square for NOT applying Pm,n- Di = ||/3n — Pn\\^ + ||/3m — Pm\\^ 
calculate distance-square for applying P^^„: D2 — \\f3n — PmW'^ + 1 1 An ^ /3nlP 
if Di > L»2 then 

apply permutation symmetry operator Pm,n- swap /3j„ <->■ l3n 
end if 
end for 



5.4.1 DE with symmetry breaking 

The DE method [29, 22] comprises a population of solution candidates 9i, which are iteratively updated and 
moved towards an optimal solution. We propose to choose the centroid of the population at each iteration 
as an estimate for the global optimum 6. 

The DE method extended by the global optimum invariant symmetry breaking [31] is denoted by DE-INV- 
SB, DE extended by the proposed global optimum variant symmetry breaking, described by Algorithm 1, is 
denoted by DE-SB and DE with global optimum variant ideal symmetry breaking using brute force search is 
denoted by DE-SB-BF. As shown in Fig. 7, in DE-based symmetry breaking approaches, symmetry breaking 
is always applied on each solution candidate 9i right after it has been updated for the next iteration. Only 
in DE-SB, we apply an additional step by increasing the error yield of some solution candidates which 
are not in the same partition as the selected partition holding 6. This increases the probability that these 
solution candidates are updated and moved closer to the selected partition. This is not required for symmetry 
breaking approaches which map each solution cadidate exactly to the selected partition, such as DE-INV-SB 
or DE-SB-BF. The DE-SB method is described in Algorithm 2. 

Algorithm 2 DE-SB. Algorithm input: population of candidate vectors 9j, j — 1, ...,Np and the centroid of 
the population as the estimate for the global optimum 9. Effect: modify candidate vectors 9j, j — 1, ...,Np 
when appropriate. 

for all candidate vectors Gj, j = 1, ..., Np do 

apply symmetry breaking on 6j , see Algorithm 1 

if Gj modified (a symmetry operator was applied) and j < Np/2 then 

multiply the stored error yield of 6j by factor 100 
end if 
end for 



5.4.2 CMA-ES with symmetry breaking 

The CMA-ES method [12, 11] adapts a global step size a, the mean m and a covariance matrix C at each 
iteration. According to the Gaussian distribution A/'(m, crC) with mean m and covariance matrix aC, Np 
solution candidate vectors are drawn. After sorting the population by the error each candidate vector yields, 
the best Np/2 samples are used to update the mean, covariance matrix and the step size for the next iteration. 
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Figure 6: Examples of symmetry breaking methods. Given a distribution of solution candidates as shown in 
the upper circle, typical outcomes of three different symmetry breaking methods are shown. In the left-bottom 
case, all solution candidates are mapped into the selected partition, but the global optimum is not necessarily 
centered within the partition. As a downside, there is a relatively strong influence of the global optimum from 
the neighbor partition. In the center-bottom case, the selected partition is chosen such that the distance to 
other symmetric replica of the global optimum are maximized, and all solution candidates are mapped into the 
selected partition. The right-bottom case shows the proposed approximate global optimum variant symmetry 
breaking. It equals the center-bottom case, except that the solution candidates are not necessarily mapped into 
the selected partition, but also to other partitions close to the selected one. 
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Figure 7: Flowgraph for DE with symmetry breaking. 



In the following discussion, the CMA-ES method extended by the global optimum invariant symmetry 
breaking [31] is denoted by CM-ES-INV-SB, CMA-ES extended by the proposed global optimum variant 
symmetry breaking, described by Algorithm 1, is denoted by CMA-ES-SB and CMA-ES with global optimum 
variant ideal symmetry breaking using brute force search is denoted by CMA-ES-SB-BF. 

In CMA-ES-INV-SB, CMA-ES-SB and CMA-ES-SB-SF, symmetry breaking is apphed right after the 
evaluation of all candidate vectors and prior to updating the parameters of the Gaussian distribution. In 
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CMA-ES-SB, we propose to use the best candidate vector (yielding the smaUest error) so far as the estimate 
for the global optimum, denoted by 6. In Fig. 8, the flowgraph for CMA-ES-based symmetry breaking 
approaches is shown. For CMA-ES-SB, the update of the mean is described in Algorithm 3. In all other 
CMA-ES-based methods, the original update formula for the mean is applied. 

In CMA-ES, applying symmetry breaking introduces a bias in the mean, which can lead to an excessive 
increase of the global step size and negatively affect the performance. This bias results from the rotations 
caused by the symmetry operators. These rotations move solution candidates to the vicinity of one partiton, 
which typically increases the radius of the population mean, as shown in Fig. 6. In order to prevent such an 
increase, in all CMA-ES-based symmetry breaking methods, we modify the damping term for the update of 
the global step size a. Let s be the shift vector of the centroid of the best Np/2 solution candidates induced 
by applying symmetry breaking. The regular update formula for a 

cTfe+i = o-fc exp(x) (45) 

is changed to 

afc+i = dfcexp [xexp(-0.05Z?2||g||)] ^ (46^ 

where k is the iteration number and x is a term depending on the difference of the previous mean and the 
current mean, and several other parameters. 



Algorithm 3 CMA-ES-SB. Algorithm input: population of candidate vectors 6j, j = l,...,Np, the estimate 



for the global optimum 9 and weights Wj, j 
when appropriate. 

for all candidate vectors Gj, j = 1, ..., Np do 
set mean vector m :— 

apply symmetry breaking on 6j , see Algorithm 1 
if Gj modified (a symmetry operator was applied) then 

add weighted global optimum estimate to mean vector: m := m -\- WjG 
else 

add weighted candidate vector to mean vector: m := m + WjG 
end if 
end for 



l,...,Np. Effect: modify candidate vectors 9j, j — l,...,Np 
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Figure 8: Flowgraph for CMA-ES with symmetry breaking. 



6 Experiments 



In this section, we introduce results of experiments to demonstrate the performance improvements by sym- 
metry breaking. The following methods are compared using regression and classification tests. From the 



Table 1: Normalized training set errors for the regression and the autoencoding problems, and normalized 
test set errors for the classification problems. The best results are printed in boldface. For each problem 
and method, errors are normalized by the maximum error from within the corresponding regular method, its 
extension by global optimization invariant symmetry breaking and its extension by global optimization variant 
symmetry breaking. 







DE 


DE-INV-SB 


DE-SB 


CMA-ES 


CMA-ES-INV-SB 


CMA-ES-SB 


syn5 


0.958 


± 0.079 


1.000 ± 0.186 


0.949 


± 0.039 


1.000 ± 4.446 


0.3S6 ± 0.469 


0.093 ± 0.007 


sine 


1.000 


± 0.859 


0.412 ± 0.166 


0.114 


± 0.008 


0.459 ± 0.271 


1.000 ± 0.784 


0.139 ± 0.051 


inc-siiTC 


1.000 


± 0.963 


0.337 ± 0.155 


0.089 


± 0.016 


0.287 ± 0.336 


1.000 ± 0.707 


0.082 ± 0.035 


sinc2d 


1.000 


± 0.387 


0.995 ± 0.094 


0.875 


± 0.029 


0.975 ± 0.139 


1.000 ± 0.241 


0.089 ± 0.253 


sincSd 


0.622 


± 0.029 


1.000 ± 0.572 


0.603 


± 0.033 


1.000 ± 1.401 


0.090 ± 0.013 


0.043 ± 0.021 


autoenc-circle 


0.057 


± 0.082 


1.000 ± 1.850 


0.020 


± 0.030 


1.000 ± 0.295 


0.626 ± 0.548 


0.077 ± 0.164 


autoenc-spiral 


0.341 


± 0.545 


1.000 ± 0.932 


0.116 


± 0.308 


0.248 ± 0.232 


1.000 ± 0.882 


0.030 ± 0.024 


autoenc-sphere 


0.554 


± 0.321 


1.000 ± 0.064 


0.022 


± 0.012 


0.050 ± 0.012 


1.000 ± 0.416 


0.032 ± 0.008 


two-circles 


0.450 


± 0.225 


1.000 ± 0.182 


0.269 


± 0.074 


0.635 ± 0.368 


1.000 ± 0.284 


0.326 ± 0.169 


two-spirals 


0.918 


± 0.260 


1.000 ± 0.213 


0.426 


it 0.197 


1.000 ± 0.228 


0.930 ± 0.201 


0.683 ± 0.293 


digits 


0.325 


± 0.087 


1.000 ± 0.111 


0.272 


± 0.062 


1.000 ± 0.352 


0.805 ± 0.113 


0.668 ± 0.099 



DE-family: Differential Evolution (DE), DE with global optimum invariant symmetry breaking (DE-INV- 
SB), DE with global optimum variant symmetry breaking (DE-SB) and DE with global optimum variant 
ideal symmetry breaking using brute force search (DE-SB-BF). From the CMA-ES-family: Covariance Ma- 
trix Adaptation Evolution Strategies (CMA-ES), CMA-ES with global optimum invariant symmetry breaking 
(CMA-ES-INV-SB), CMA-ES with global optimum variant symmetry breaking (CMA-ES-SB) and CMA-ES 
with global optimum variant ideal symmetry breaking using brute force search (CMA-ES-SB-BF). It should 
be noted that the purpose of this investigation is not to present the best global optimization method for 
ANN-learning, but to demonstrate the benefits of symmetry breaking. 
With a ZJ-dimensional parameter space, all tests are performed with following settings: 



• DE, DE-SB, DE-INV-SB and DE-SB-BF settings: F = 0.5, C,. 
generated in D-dim. hypercube [—1, 1]^ (uniformly). 



0.9, initial population is randomly 



• CMA-ES, CMA-ES-SB, CMA-ES-INV-SB and CMA-ES-SB-BF settings: we used suggested settings 
for enhanced global search abilities, mentioned in the C-code reference implementation. 

• in all experiments, the optimization is finished when a maximum number of ANN-function-evaluations 
is reached. 

Given a parameter 9 and a data set {xi,yi), we define the Mean Squared Error (MSE) e according to Eqn. (8): 



1 ^ 

— V(yfe - n{e; Xk)Viyk - n{e; Xk)). 

■ n ^ — ' 



K -q 



(47) 



fc=i 



In order to limit the _D-dimensional parameter space to a feasible region, we apply a penalty approach. 
Due to the length-invariance by the symmetry operators as shown in Eqn. (15), the feasible region is defined 
by a hypersphere. In case of \\9\\ > \/D, the error function (47) is evaluated at a rescaled parameter vector 
tMtt and a penalty term 5O(||0|| — a/D) is added to the error e. 

In self-generated data sets, we add normal distributed noise with zero mean and variance ct^ to the 
function values j/i 

y, = fix,) + Ai, /^ - AA(0, a^), a = 5x10-3. (48) 



6.1 Experimental setup 

In all experiments, data is normalized such that mean is zero and variance is one. The population size Np 
used in DE and CMA-ES depends on the problem and the choice of the optimization method. Therefore, it is 
manually adapted accordingly. For each problem and each optimization method, we conduct 50 independent 
repetitions of the optimization process and record the error over the number of ANN-evaluations. To test 
for statistical significance of the obtained results, first the Kruskal-Wallis test [14] for the hypothesis that all 
performance means are equal is applied. In case this hypothesis is rejected, the Wilcoxon rank sum test [35] 
is applied to all pairs of means to identify significantly different results. All tests are based on a significance 
level of 0.05. In Table 6.1, normalized training set errors for the regression and the autoencoding problems, 
and normalized test set errors for the classification problems are shown. 
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6.2 Regression problems 

As in [4] , we apply learning only on a training set to compare the performance of the introduced methods. 
In the following, the regression problems are introduced and corresponding results are shown. 

6.2.1 Dataset syn5 

The syn5 dataset is generated by the fourth-degree polynome {x — 0.5)^ • (0.1 + {x + 0.65)^) with uniformly 
distributed random input values Xi G ( — 1, 1). We use a 1-3-1 net and 200 data samples. The population 
size for all DE-based methods is Np — 80, and Np — 48 for all CMA-ES-based methods. Fig. 9 shows the 
resulting convergence curves and box plots for the learning process. For the DE-family, the Kruskal-Wallis 
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Figure 9: Convergence curves for regression by DE (left) and CMA-ES (right) using the syn5 dataset. 

test showed no significant difference in means. In contrast, according to the Wilcoxon tests, the inequality of 
means of CMA-ES and CMA-ES-SB is rejected by a narrow margin, with a corresponding p-value of 0.08. 
The other means are significantly different. All DE variants reach the same low-error, where DE-SB shows 
the fastest decrease in error. As for the CMA-ES variants, CMA-ES fails to reach a low error in a few runs, 
which leads to a larger mean error in average. In contrast, CMA-ES-SB proves to be more robust and reaches 
a relatively low error in all runs. 
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6.2.2 Dataset sine 



The sine dataset is generated by the function '^'"^„ ^•^ with uniformly distributed random input values Xi £ 



(—1, 1). We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is Np — 120, 
and Np = 400 for all CMA-ES-based methods. Fig. 10 shows the resulting convergence curves and box plots 
for the learning process. According to the Wilcoxon tests, all pairwise differences are significant. DE-SB 
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Figure 10: Convergence curves for regression by DE (left) and CMA-ES (right) using the sine dataset. 

clearly outperforms DE and DE-INV-SB. Similarly, CMA-ES-SB is the fastest among the CMA-ES-based 
methods. 
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6.2.3 Dataset inc-sinc 

The inc-sinc dataset is generated by the function | + '^"\oa;^ with uniformly distributed random input values 
Xi E (—1,1). We use a 1-5-1 net and 200 data samples. The population size for all DE-based methods is 
Np = 144, and Np = 400 for all CMA-ES-based methods. Fig. 11 shows the resulting convergence curves and 
box plots for the learning process. According to the Wilcoxon tests, all pairwise differences are significant. 
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Figure 11: Convergence curves for regression by DE (left) and CMA-ES (right) using the inc-sinc dataset. 

Interestingly, the global optimum invariant symmetry breaking approach leads to in improvement for DE 
(DE-INV-SB), but shows inferior performance on CMA-ES (CMA-ES-INV-SB). This proves that symmetry 
breaking approaches should be specific to the selected global optimization method. Again, DE-SB and 
CMA-ES-SB are the fastest methods. 
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6.2.4 Dataset sinc2d 

The sinc2d dataset is generated by the function '^'^L jj with uniformly distributed random input values 
Xi e (—1, l)'^. We use a 2-3-1-3-1 net and 1000 data samples. The population size for all DE-based methods is 
Np = 96, and Np = 1000 for all CMA-ES-based methods. Fig. 12 shows the resulting convergence curves and 
box plots for the learning process. According to the Wilcoxon tests, all pairwise differences are significant, 
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Figure 12: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc2d dataset. 

except the difference between CMA-ES and CMA-ES-INV-SB. The proposed symmetry breaking approach 
shows a very clear impact on the CMA-ES- variants. While CMA-ES and CMA-ES-INV-SB fail to solve this 
problem completely, CMA-ES-SB successfully trains the ANN in the majority of the 50 runs. 
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6.2.5 Dataset sinc3d 



The sinc3d dataset is generated by the function '^'^L jj with uniformly distributed random input values 



1, 1)'^. We use a 3-4-1-4-1 net and 1000 data samples. The population size for all DE-based methods 
p — 120, and Np = 1000 for all CMA-ES-based methods. Fig. 13 shows the resulting convergence curves 
and box plots for the learning process. According to the Wilcoxon tests, all pairwise differences are significant. 
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Figure 13: Convergence curves for regression by DE (left) and CMA-ES (right) using the sincSd dataset. 

Again, DE-SB and CMA-ES-SB are the fastest methods. This time, in contrast to previous experiments, 
CMA-ES-INV-SB clearly outperforms CMA-ES. 

6.3 Autoencoding problems 

In this section, all d-dimensional data samples lie on a s-dimensional set, where s < d. As a result, the data 
can be described, or 'encoded' by an s-dimensional subset. On the other hand, there is also a s-D to d-D 
mapping to 'decode' the data. The task is to approximate both the encoding and decoding mapping by an 
ANN. As in the case of the regression problems, the performance is compared only on the training using a 
training set. 
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6.3.1 Dataset autoenc-circle 

In this problem, the data samples lie on a 2-D circle centered at the origin with radius one. We use a 2-5- 
3-2-1-2-3-5-2 net and 200 data samples to encode from 2-D to 1-D and decode back to 2-D. The population 
size for all DE-based methods is Np = 64, and Np = 4000 for all CMA-ES-based methods. Fig. 14 shows 
the resulting convergence curves and box plots for the learning process. All pairwise differences prove to be 
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Figure 14: Convergence curves for regression by DE (left) and CMA-ES (right) using the sincSd dataset. 

statistically significant. The proposed symmetry beraking approach improves the training in both methods. 
On CMA-ES-SB, the difference ttirns out to be quite significant. 
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6.3.2 Dataset autoenc-spiral 

In this problem, the data samples lie on a 3-D spiral with radius one, defined by 

(cos((/>),sin(0),(/)),0e [0,67r]. 

We use a 3-1-3-4-7-3 net and 1000 data samples to encode from 3-D to 1-D and decode back to 3-D. The 
population size for all DE-based methods is Np — 80, and Np — 400 for all CMA-ES-based methods. Fig. 15 
shows the resulting convergence curves and box plots for the learning process. All pairwise differences prove 
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Figure 15: Convergence curves for regression by DE (left) and CMA-ES (right) using the sinc3d dataset. 



to be statistically significant. 
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6.3.3 Dataset autoenc-sphere 

In this problem, the data samples lie on a 3-D sphere centered at the origin with radius one. We use a 
3-8-5-2-5-8-3 net and 1000 data samples to encode from 3-D to 2-D and decode back to 3-D. The population 
size for all DE-based methods is Np = 96, and Np = 1000 for all CMA-ES-based methods. Fig. 16 shows 
the resulting convergence curves and box plots for the learning process. All pairwise differences prove to be 
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Figure 16: Convergence curves for regression by DE (left) and CMA-ES (right) using the autoenc-sphere 
dataset. 

statistically significant. Clearly, DE-SB and CMA-ES-SB are significantly faster then the other methods. 

6.4 Classification problems 

In classification problems, data samples are divided into a training set, a validation set and a test set. All 
three sets are generated by random selection of samples. A winner-takes-all scheme is applied to distinguish 
different classes, i.e., given an input, the ANN-output component with the greatest value determines the 
class. In order to improve generalization, classification performance measures on the training and test set 
are updated only on each improvement of the validation set classification performance. 
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6.4.1 Dataset: Two-Circles 

In this problem, the 2-D data domain [—1,1]^ is divided into two parts, where one part is given by the union 
area of two circles and the remaining part is the disjunct space. Hence, there are two classes: samples which 
lie inside any circle and samples which lie outside of both circles. One circle is specified by center (0.5, 0.5) 
and radius ri = 0.39894, and the other circle by center (—0.5, —0.5) and same radius r2 = 0.39894. We use a 
2-4-2-4-2 net with 400 samples for each training, validation and test set, having a total of 1200 samples. The 
population size for all DE-based methods is Np = 80, and Np = 400 for all CMA-ES-based methods. Fig. 15 
shows the resulting convergence curves and box plots for the learning process. All pairwise differences prove 
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Figure 17: Classification error rates over ANN -evaluations on the Two-Circles dataset using the DE- 
variants. 
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Figure 18: Classification error rates over ANN-evaluations on the Two-Circles dataset using the C'MA- 
ES-variants. 
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to be statistically significant. It can be seen that again DE-SB and CMA-ES-SB dominate the performances. 
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6.4.2 Dataset: Two-Spirals 

This problem [17] contains 2-D data-samples from two spirals on the plane, both starting at the origin and 
going around each other. The task is to classify each data sample by deciding to which spiral it belongs 
to. We use a 2-8-3-1-3-8-2 net, 114 samples for the training set, 40 samples for the validation set and 
40 samples for the test set. The population size for all DE-based methods is Np = 120, and Np — 1000 
for all CMA-ES-based methods. Fig. 19 and 20 show the resulting convergence curves and box plots for 
the learning process. On the training and test set, DE and DE-INV-SB mean results are not statistical 
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Figure 19: Classification error rates over ANN- evaluations on the Two-Spirals dataset using the DE- 
variants. 
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Figure 20: Classification error rates over ANN- evaluations on the Two-Spirals dataset using the CMA- 
ES-variants. 



significantly different. Furthermore, on the test set, CMA-ES and CMA-ES-INV-SB mean results are not 



significantly different. Otlierwise, all other pairwise differences prove to be statistically significant. DE-SB 
and CMA-ES-SB continue to show the best results. 
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6.4.3 Dataset: Digits 

This problem [1] deals with the recognition of handwritten digits, which results in a classification problem 
with 10 classes. The data is generated by asking several writers to write 250 digits in random order inside 
boxes of 500 by 500 tablet pixel resolution. There are 16 features extracted from the digitized data. We use 
a 16-8-3-10-10 net and 1000 data samples each for the training set, the validation set as well as the test set. 
All pairwise differences prove to be statistically significant. Again, DE-SB and CMA-ES-SB are the fastest 
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Figure 21: Classification error rates over ANN- evaluations on the Digits dataset using the DE-variants. 
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Figure 22: Classification error rates over ANN-evaluations on the Digits dataset using the CMA-ES- 
variants. 



methods. 



26 



6.5 Ideal separation 

In this Section, we compare the ideal separation to the proposed approximations. Since the complexity of 
the brute force method for the ideal separation is exponential, we restrict the experiments to small networks 
as used in the problems syn5, sine and inc-sinc. It can be seen that the results are almost identical. 
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Figure 23: Comparing DE-SB with DE using ideal separation by brute force symmetry breaking (DE-SB-BF) 
on the syn5; sine and inc-sine datasets. 
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Figure 24: Comparing CMA-ES-SB with CMA-ES using ideal separation by brute force symmetry breaking 
(CMA-ES-SB-BF) on the syn5, sine and ine-sinc datasets. 
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7 Conclusions 

The problem of symmetries in ANN-parameter space is a well known problem resulting in important compli- 
cation in the training of ANN's. However, a detailed investigation of this problem for Evolutionary Algorithms 
other than Genetic Algorithms is missing in the literature. Furthermore, there are contradictionary results 
about the efHcacy of symmetry breaking methods in the performance of the global search. We show that 
a possible explanation for this situation is the use of symmetry breaking methods which are invariant to 
the global optimum and therefore can only be effective on a limited number of problems. Furthermore, we 
show theoretically and illustrate experimentally, that the application of global optimum invariant symmetry 
breaking may even lead to inferior performance. To circumvent these problems, we propose methods for 
global optimum variant symmetry breaking approaches for Differential Evolution (DE) and Covariance Ma- 
trix Adaptation Evolution Strategies (CMA-ES), which are two popular, robust and state-of-the-art global 
optimization methods. 

Experimental studies conducted on fixed topology feedforward neural networks indicate a significant 
improvement over standard DE and CMA-ES techniques in terms of global convergence speed. Further 
comparisons of the proposed approach with a common global optimum invariant symmetry breaking approach 
support our hypotheses. 

Based on the obtained results, we conclude that other global optimization based methods may also benefit 
from the use of the proposed global optimum variant symmetry breaking. Further research is required to 
adapt the proposed approach to other techniques to improve their performance. 

The proposed method can be tested and verified using the open source C++ Monte Carlo Machine 
Learning Library (MCMLL) , which is available under the GNU GPLv2 license. The website of the library 
can be found on: mcmll.sourceforge.net. The project website is available at sourceforge.nct/projects/mcmll. 
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